A novel alignment model inspired on IBM Model 1

نویسندگان

  • Jesús González-Rubio
  • Germán Sanchis-Trilles
  • Alfons Juan
  • Francisco Casacuberta
چکیده

We present an extension to IBM Model 1 for training word-to-word lexicon probabilities. This model takes into account a given fixed segmentation of the source and target sentences in the estimation of the statistical dictionary. Our experimentation on the Europarl corpus shows that a statistical consistent improvement in the translation quality can be achieved by including our proposed model as a new information source in a log-linear combination of models. 1 Statistical Machine Translation The goal of Machine Translation is the translation of a text given in some source language into a target language. We are given a source language sentence f = f1 . . . fj . . . fJ which is to be translated into a target language sentence. Among all possible target language sentences, we will choose the sentence ê = e1 . . . ei . . . eI which maximises the posterior probability. Such statement is formalised in the Fundamental Equation of Machine Translation: ê = argmax e {Pr(e|f)} = argmax e {Pr(e) · Pr(f |e)} . (1) The argmax operation denotes the search problem, i.e. the generation of the output sentence in the target language. The decomposition in Eq. (1) allows an independent modelling of the target language model Pr(e) and the (inverse) translation model Pr(f |e)1, known as source-channel model [1]. This decomposition has a very intuitive interpretation: the translation model Pr(f |e) will capture the word relations between both input and output languages, whereas the language model Pr(e) will ensure that the output sentence is a well-formed sentence belonging to the target language. Many statistical translation models [2–5] try to model word-to-word correspondences between source and target words. Known as statistical alignment models, these models typically yield the following equation:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Approach for English-Chinese Named Entity Alignment

Traditional word alignment approaches cannot come up with satisfactory results for Named Entities. In this paper, we propose a novel approach using a maximum entropy model for named entity alignment. To ease the training of the maximum entropy model, bootstrapping is used to help supervised learning. Unlike previous work reported in the literature, our work conducts bilingual Named Entity align...

متن کامل

Inner-Outer Bracket Models for Word Alignment using Hidden Blocks

Most statistical translation systems are based on phrase translation pairs, or “blocks”, which are obtained mainly from word alignment. We use blocks to infer better word alignment and improved word alignment which, in turn, leads to better inference of blocks. We propose two new probabilistic models based on the innerouter segmentations and use EM algorithms for estimating the models’ paramete...

متن کامل

Hidden Markov Tree Model for Word Alignment

We propose a novel unsupervised word alignment model based on the Hidden Markov Tree (HMT) model. Our model assumes that the alignment variables have a tree structure which is isomorphic to the target dependency tree and models the distortion probability based on the source dependency tree, thereby incorporating the syntactic structure from both sides of the parallel sentences. In English-Japan...

متن کامل

Improving IBM Word Alignment Model 1

We investigate a number of simple methods for improving the word-alignment accuracy of IBM Model 1. We demonstrate reduction in alignment error rate of approximately 30% resulting from (1) giving extra weight to the probability of alignment to the null word, (2) smoothing probability estimates for rare words, and (3) using a simple heuristic estimation method to initialize, or replace, EM train...

متن کامل

Simultaneous Word-Morpheme Alignment for Statistical Machine Translation

Current word alignment models for statistical machine translation do not address morphology beyond merely splitting words. We present a two-level alignment model that distinguishes between words and morphemes, in which we embed an IBM Model 1 inside an HMM based word alignment model. The model jointly induces word and morpheme alignments using an EM algorithm. We evaluated our model on Turkish-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008